Search Results: "Johannes Schauer"

5 June 2013

Johannes Schauer: Reliable IMAP synchronization with IDLE support

The task: reliably synchronize my local MailDir with several remote IMAP mailboxes with IDLE support so that there is no need to poll with small time intervals to get new email immediately. Most graphical mail clients like icedove/thunderbird or evolution have IDLE support which means that their users get notified about new email as soon as it arrives. I prefer to handle email fetching/reading/sending using separate programs so after having used icedove for a long time, I switched to a offlineimap/mutt based setup a few years ago. Some while after that I discovered sup and later notmuch/alot which made me switch again. Now my only remaining problem is the synchronization between my local email and several remote IMAP servers. Using offlineimap worked fine in the beginning but soon I discovered some of its shortcomings. It would for example sometimes lock up or simply crash when I switched between different wireless networks or switched to ethernet or UMTS. Crashing was not a big problem as I just put it into a script which re-executed it every time it crashed. The problem was it locking up for one of its synchronizing email accounts while the others were kept in sync as usual. This once let to me missing five days of email because offlineimap was not crashing and I believed everything was fine (new messages from the other accounts were scrolling by as usual) while people were sending me worried emails whether I was okay or if something bad had happened. I nearly missed a paper submission deadline and another administrative deadline due to this. This was insanely annoying. It turned out that other mail synchronization programs suffered from the same lockup problem so I stuck with offlineimap and instead executed it as:
while true; do
timeout --signal=KILL 1m offlineimap
notmuch new
sleep 5m
done
Which would synchronize my email every five minutes and kill offlineimap if a synchronization took more than one minute. While I would've liked an email synchronizer which would not need this measure, this worked fine over the months. After some while it bugged me that icedove (which my girlfriend is using) was receiving email nearly instantaneously after they arrived and I couldnt have this feature with my setup. This instantaneous arrival of email works by using the IMAP IDLE command which allows the server to notify the client once new email arrives instead of the client having to poll for new email in small intervals. Unfortunately offlineimap (and any other email synchronizer I found) would not support the IDLE command. There is a fork of it which supports IDLE by using a newer python imap library but this is of little use to me as there is no possibility of a hook which executes when new email arrives so that I can execute notmuch new on the new email. At this point I could've used inotify to execute notmuch new upon arrival of new email but I went another way. Here is a short python script idle.py which connects to my IMAP servers and sends the IDLE command:
import imaplib
import select
import sys

class mysock():
def __init__(self, name, server, user, passwd, directory):
self.name = name
self.directory = directory
self.M = imaplib.IMAP4(server)
self.M.login(user, passwd)
self.M.select(self.directory)
self.M.send("%s IDLE\r\n"%(self.M._new_tag()))
if not self.M.readline().startswith('+'): # expect continuation response
exit(1)

def fileno(self):
return self.M.socket().fileno()

if __name__ == '__main__':
sockets = [
mysock("Name1", "imap.foo.bar", "user1", "pass1", "folder1"),
mysock("Name1", "imap.foo.bar", "user1", "pass1", "folder2"),
mysock("Name2", "imap.blub.bla", "user2", "pass2", "folder3")
]
readable, _, _ = select.select(sockets, [], [], 1980) # 33 mins timeout

found = False
for sock in readable:
if sock.M.readline().startswith('* BYE '): continue
print "-u basic -o -q -f %s -a %s"%(sock.directory, sock.name)
found = True
break

if not found:
print "-u basic -o"
It would create a connection to every directory I want to watch on every email account I want to synchronize. The select call will expire after 33 minutes as most email servers would drop the connection if nothing happens at around 30 minutes. If something happened within that time though, the script would output the arguments to offlineimap to do a quick check on just that mailbox on that account. If nothing happened, the script would output the arguments to offlineimap to do a full check on all my mailboxes.
while true; do
args= timeout --signal=KILL 35m python idle.py
echo "idle timed out" >&2;
args="-u basic -o";

echo "call offlineimap with: $args" >&2
timeout --signal=KILL 1m offlineimap $args
echo "offlineimap timed out" >&2;
continue;

notmuch new
done
Every call which interacts with the network is wrapped in a timeout command to avoid any funny effects. Should the python script timeout, a full synchronization with offlineimap is triggered. Should offlineimap timeout, an error message is written to stderr and the script continues. The above naturally has the disadvantage of not immediately responding to new email which arrives during the time that idle.py is not running. But as this email will be fetched once the next message arrives on the same account, there is no much waiting time and so far, this problem didnt bite me. Is there a better way to synchronize my email and at a same time make use of IDLE? I'm surprised I didnt find software which would offer this feature.

28 May 2013

Johannes Schauer: New bootstrap analysis results

I wrote a small wiki page for botch. It includes links to some hopefully useful resources like my FOSDEM 2013 talk or my master thesis about botch. The final goal of botch is to generate a build order with which Debian can be bootstrapped from zero to more than 18000 source packages. But such a build order can only be generated once enough source packages (by now the number is around 70) incorporate build profiles which allow to compile them with less build dependencies and thereby break all build dependency cycles. So an intermediary goal of botch is, to find a "good" (i.e. a close to minimal amount) selection of such build dependencies that should be dropped from their source packages. So far, the results of all heuristics we use for this task were just dumped to standard output. Since this textual format was hard to read for a human developer and even worse machine readable, that tool now outputs its results in JSON. This data is then converted by another tool to a more human readable format. One such format is HTML and with the help of javascript for sortable and pageable tables, the data can then be presented without producing too much visual clutter (at least compared to the initial textual format). Another advantage of HTML is, that the data can now easily be shared with other developers without them having to run botch or install any other program first. So without further ado, please find the result of running our heuristics on Debian Sid as of 2013-01-01 here: http://mister-muffin.de/bootstrap/stats/ Here is a screenshot of what you can expect (you can see part of the table of edges with most cycles through them): As a developer uses this information to add build profiles to source packages, this HTML page can be regenerated and will show which strongly connected components are now left in the graph and how to best make them acyclic as well. Besides other improvements I also updated the data I used to generate the graphs from this post. I found this update to be necessary since we found and fixed several implementation bugs since October 2012, so this new plot should be more accurate. Here a graph which shows the amount of vertices in the biggest strongly connected component of Debian Sid by date. Some datapoints are missing for dates for which important source packages in Sid were not compilable.

16 April 2013

Johannes Schauer: botch - the debian bootstrap software has a name

After nearly one year of development, the "Port bootstrap build-ordering tool" now finally has a name: botch. It stands for Bootstrap/Build Order Tool CHain. Now we dont have to call it "Port bootstrap build-ordering tool" anymore (nobody did that anyways) and also dont have to talk about it by referring to it as "the tools" in email, IRC or in my master thesis which is due in a few weeks. With this, the issue also doesnt block the creation of a Debian package anymore. Since only a handful of people have a clone of it anyway, I also renamed the gitorious git repository url (and updated all links in this blog and left a text file informing about the name change in the old location). The new url is: https://gitorious.org/debian-bootstrap/botch Further improvements since my last post in January include: In the mean time a paper about the theoretical aspects of this topic which Pietro Abate and I submitted to the CBSE 2013 conference got accepted (hurray!) and I will travel to the conference in Canada in June. Botch (yeey, I can call it by a name!) is also an integral part of one of the proposals for a Debian Google Summer of Code project this year, mentored by Wookey and co-mentored by myself. Lets hope it gets accepted and produces good results in the end of the summer!

1 February 2013

Johannes Schauer: Bootstrappable Debian FOSDEM talk

I will give a talk about the current status of automating Debian bootstrapping at FOSDEM 2013. There will also be a live demo of the developed toolchain. The talk will start 16:30 on Saturday, 2 February 2013 in room H.1302 as part of the Cross distro devroom track and will be half an hour long. Directly after my talk, at 17:00, Wookey will give his about bootstrapping Debian/Ubuntu for arm64 in the same location. The current version of the slides can be downloaded here. The latex beamer source is available in the git repository of the debian bootstrap project. If you can't make it to FOSDEM, then my last post about the topic gives away most of the things I will be talking about.

25 January 2013

Johannes Schauer: Bootstrappable Debian - New Milestone

This post is about the port bootstrap build ordering tool (naming suggestions welcome) which was started as a Debian Google Summer of Code project in 2012 and continued to be developed afterwards. Sources are available through gitorious. In the end of November 2012, I managed to put down an approximation algorithm to the feedback arc set problem which allowed to break the dependency graph into a directed acyclic graph with only few removed build dependencies. I wrote about this effort on our mailinglist but didnt mention it here because it was still too much of a proof-of-concept. Later, in January 2013, I mentioned the result of this algorithm in an email wookey and me wrote to debian-devel mailinglist. Many things happened since November 2012:

Processing pipeline instead of monolithic tools The tools I developed so far tried to accomplish everything by themselves, reusing functionality implemented in a central library. Therefor, if one wanted to try out even trivial new things, it mostly meant to hack some OCaml code. Pietro Abate suggested to instead develop smaller tools which could work independently of each other, would only execute one algorithm each and could easily be connected together in different ways to achieve different effects. This switch is now done and all functionality from the old tools is moved into a new toolset. The exchange format between the tools is either plain text files in deb822 control format (Packages/Sources files) or a dependency graph. The dependency graph is currently marshalled by OCaml but future versions should work with just passing a GraphML (an XML graph format) representation around. This new way of doing things seems close to the UNIX philosophy (each program does one thing well, data is stored as text, every program is a filter). For example the deb822 control output can easily be manipulated using grep-dctrl(1) and there exist many tools which can read, analyze and manipulate GraphML. It is a big improvement over the old, monolithic tools which did not allow to manipulate any intermediate result by external, existing tools. Currently, a shell script (native.sh) will execute all tools in a meaningful order, connecting them together correctly. The same tools will be used for a future cross.sh but they will be connected differently. I wrote about a first proposal of what the individual tools should do and how they should be connected in this email from which I also linked a confusing overview of the pipeline. This overview has recently been improved to be even more confusing and the current version of the lower half (the native part) now looks like this: native reduced Solid arrows represent a flow of binary packages, dottend arrows represent a flow of source packages, ovals represent a set of packages, boxes with rounded corners represent set operations, rectangular boxes represent filters. There is only one input to a filter, which is the arrow connected to the top of the box. Outgoing arrows from the bottom represent the filtered input. Ingoing arrows to either side are arguments to the filter and control how the filter behaves depending on the algorithm. I will explain this better once the pipeline proved to be less in flux. The pipeline is currently executed like this by native.sh.

Two new ways to break dependency cycles have been discovered So far, we knew of three ways to formally break dependency cycles:
  • remove build dependencies through build profiles
  • find out that the build dependency is only used to build arch:all packages and therefor put it into Build-Depends-Indep
  • cross compile some source packages
Two new methods can be added to the above:

Dependency graph definition changed Back in September when I was visiting irill, Pietro found a flaw in how the dependency graph used to be generated. He supplied a new definition of the dependency graph which does away with the problem he found. After fixing some small issues with his code, I changed all the existing algorithms to use the new graph definition. The old graph is as of today removed from the repository. Thanks to Pietro for supplying the new graph definition - I must still admit that my OCaml foo is not strong enough to have come up with his code.

Added complexity for profile built source packages As mentioned in the introduction, wookey and me addressed the debian-devel list with a proposal on necessary changes for an automated bootstrapping of Debian. During the discussiong, two important things came up which are to be considered by the dependency graph algorithms:
  • profile built source packages may not create all binary packages
  • profile built source packages may need additional build dependencies
Both things make it necessary to alter the dependency graph during the generation of a feedback arc set and feedback vertex set beyond the simple removal of edges. Luckily, the developed approximation algorithms can be extended to support such changes in the graph.

Different feedback arc set algorithms The initially developed feedback arc set algorithm is well suited to discover build dependency edges which should be dropped. It performs far worse when creating the final build order because it only considers edges by itself and not how many edges of a source package can be dropped by profile building it. The adjusted algorithm for generating a build order is more of a feedback vertex set algorithm because instead of greedily finding the edge with most cycles through it, it greedily finds the source package which would break most cycles if it was profile built.

Generating a build order with less profile built source packages After implementing all the features above I now feel more confident to publish the current status of the tools to a wider audience. The following test shows a run of the aforementioned ./native.sh shell script. Its final output is a list of source packages which have to be profile built and a build order. Using the resulting build order, starting from a minimal build system (essential:yes, build-essential and debhelper), all source packages will be compiled which are needed to compile all binary packages in the system. The result will therefor be a list of source and binary packages which fulfill the following property:
  • all binary packages can be built from the available source packages
  • all source packages can be built with the available binary packages
I called this a "reduceded distribution" in earlier posts. The interesting property of this specific selection is, that it contains the biggest problem set of Debian when bootstrapping it: a 900 to 1000 nodes big strongly connected component. Here is a visualization of the problem: hideous mess Source packages do not yet come with build profiles and the cross build situation can not yet be analyzed, so the following assumptions were made: The last point is about 14 build dependencies which were decided to be broken by the feedback arc set algorithm but for which other data sources did not indicate that they are actually breakable. Those 14 are dynamically generated by native.sh. If above assumptions should not be too far from the actual situation, then not more than 73 source packages have to be modified to bootstrap a reduced distribution. This reduced distribution even includes dependency-wise "big" packages like webkit, metacity, iceweasel, network-manager, tracker, gnome-panel, evolution-data-server, kde-runtime, libav and nautilus. By changing one line in native.sh one could easily develop a build order which generates gnome-desktop or really any given (meta-)package selection. All of native.sh takes only 80 seconds to execute on my system (Core i5, 2.5 GHz, singlethreaded). Here is the final build order which creates 2044 binary packages from 613 source packages.
  1. nspr, libio-pty-perl, libmcrypt, unzip, libdbi-perl, cdparanoia, libelf, c-ares, liblocale-gettext-perl, libibverbs, numactl, ilmbase, tbb, check, libogg, libatomic-ops, libnl($), orc($), libaio, tcl8.4, kmod, libgsm, lame, opencore-amr, tcl8.5, exuberant-ctags, mhash, libtext-iconv-perl, libutempter, pciutils, gperf, hspell, recode, tcp-wrappers, fdupes, chrpath, libbsd, zip, procps, wireless-tools, cpufrequtils, ed, libjpeg8, hesiod, pax, less, dietlibc, netkit-telnet, psmisc, docbook-to-man, libhtml-parser-perl, libonig, opensp($), libterm-size-perl, linux86, libxmltok, db-defaults, java-common, sharutils, libgpg-error, hardening-wrapper, cvsps, p11-kit, libyaml, diffstat, m4
  2. openexr, enca, help2man, speex, libvorbis, libid3tag, patch, openjade1.3, openjade, expat, fakeroot, libgcrypt11($), ustr, sysvinit, netcat, libirman, html2text, libmad, pth, clucene-core, libdaemon($), texinfo, popt, net-tools, tar, libsigsegv, gmp, patchutils($), dirac, cunit, bridge-utils, expect, libgc, nettle, elfutils, jade, bison
  3. sed, indent, findutils, fastjar, cpio, chicken, bzip2, aspell, realpath, dctrl-tools, rsync, ctdb, pkg-config($), libarchive, gpgme1.0, exempi, pump, re2c, klibc, gzip, gawk, flex-old, original-awk, mawk, libtasn1-3($), flex($), libtool
  4. libcap2, mksh, readline6, libcdio, libpipeline, libcroco, schroedinger, desktop-file-utils, eina, fribidi, libusb, binfmt-support, silgraphite2.0, atk1.0($), perl, gnutls26($), netcat-openbsd, ossp-uuid, gsl, libnfnetlink, sg3-utils, jbigkit, lua5.1, unixodbc, sqlite, wayland, radvd, open-iscsi, libpcap, linux-atm, gdbm, id3lib3.8.3, vo-aacenc, vo-amrwbenc, fam, faad2, hunspell, dpkg, tslib, libart-lgpl, libidl, dh-exec, giflib, openslp-dfsg, ppl($), xutils-dev, blcr, bc, time, libdatrie, libpthread-stubs, guile-1.8, libev, attr, libsigc++-2.0, pixman, libpng, libssh2, sqlite3, acpica-unix, acl, a52dec
  5. json-c, cloog-ppl, libverto, glibmm2.4, rtmpdump, libdbd-sqlite3-perl, nss, freetds, slang2, libpciaccess, iptables, e2fsprogs, libnetfilter-conntrack, bash, libthai, python2.6($), libice($), libpaper, libfontenc($), libxau, libxdmcp($), openldap($), cyrus-sasl2($), openssl, python2.7($)
  6. psutils, libevent, stunnel4, libnet-ssleay-perl, libffi, readline5, file, libvoikko, gamin, libieee1284, build-essential, libcap-ng($), lcms($), keyutils, libxml2, libxml++2.6, gcc-4.6($), binutils, dbus($), libsm($), libxslt, doxygen($)
  7. liblqr, rarian, xmlto, policykit-1($), libxml-parser-perl, tdb, devscripts, eet, libasyncns, libusbx, icu, linux, libmng, shadow, xmlstarlet, tidy, gavl, flac, dbus-glib($), libxcb, apr, krb5, alsa-lib
  8. usbutils, enchant, neon27, uw-imap, libsndfile, yasm, xcb-util, gconf($), shared-mime-info($), audiofile, ijs($), jbig2dec, libx11($)
  9. esound, freetype, libxkbfile, xvidcore, xcb-util-image, udev($), libxfixes, libxext($), libxt, libxrender
  10. tk8.4, startup-notification, libatasmart, fontconfig, libxp, libdmx, libvdpau, libdrm, libxres, directfb, libxv, libxxf86dga, libxxf86vm, pcsc-lite, libxss, libxcomposite, libxcursor, libxdamage, libxi($), libxinerama, libxrandr, libxmu($), libxpm, libxfont($)
  11. xfonts-utils, libxvmc, xauth, libxtst($), libxaw($)
  12. pmake, corosync, x11-xserver-utils, x11-xkb-utils, coreutils, xft, nas, cairo
  13. libedit, cairomm, tk8.5, openais, pango1.0($)
  14. ocaml, blt, ruby1.8, firebird2.5, heimdal, lvm2($)
  15. cvs($), python-stdlib-extensions, parted, llvm-2.9, ruby1.9.1, findlib, qt4-x11($)
  16. xen, mesa, audit($), avahi($)
  17. x11-utils, xorg-server, freeglut, libva
  18. jasper, tiff3, python3.2($)
  19. openjpeg, qt-assistant-compat, v4l-utils, qca2, jinja2, markupsafe, lcms2, sip4, imlib2, netpbm-free, cracklib2, cups($), postgresql-9.1
  20. py3cairo, pycairo, pam, libgnomecups, gobject-introspection($)
  21. gdk-pixbuf, libgnomeprint, gnome-menus, gsettings-desktop-schemas, pangomm, consolekit, vala-0.16($), colord($), atkmm1.6
  22. libgee, gtk+3.0($), gtk+2.0($)
  23. gtkmm2.4, poppler($), openssh, libglade2, libiodbc2, gcr($), libwmf, systemd($), gcj-4.7, java-atk-wrapper($)
  24. torque, ecj($), vala-0.14, gnome-keyring($), gcc-defaults, gnome-vfs($)
  25. libidn, libgnome-keyring($), openmpi
  26. mpi-defaults, dnsmasq, wget, lynx-cur, ghostscript, curl
  27. fftw3, gnupg, libquvi, xmlrpc-c, raptor, liboauth, groff, fftw, boost1.49, apt
  28. boost-defaults, libsamplerate, cmake, python-apt
  29. qjson, qtzeitgeist, libssh, qimageblitz, pkg-kde-tools, libical, dwarves-dfsg, automoc, attica, yajl, source-highlight, pygobject, pygobject-2, mysql-5.5($)
  30. libdbd-mysql-perl, polkit-qt-1, libdbusmenu-qt, raptor2, dbus-python, apr-util
  31. rasqal, serf, subversion($), apache2
  32. git, redland
  33. xz-utils, util-linux, rpm, man-db, make-dfsg, libvisual, cryptsetup, libgd2, gstreamer0.10
  34. mscgen, texlive-bin
  35. dvipng, luatex
  36. libconfig, transfig, augeas, blas, libcaca, autogen, libdbi, linuxdoc-tools, gdb, gpm
  37. ncurses, python-numpy($), rrdtool, w3m, iproute, gcc-4.7, libtheora($), gcc-4.4
  38. libraw1394, base-passwd, lm-sensors, netcf, eglibc, gst-plugins-base0.10
  39. libiec61883, qtwebkit, libvirt, libdc1394-22, net-snmp, jack-audio-connection-kit($), bluez
  40. redhat-cluster, gvfs($), pulseaudio($)
  41. phonon, libsdl1.2, openjdk-6($)
  42. phonon-backend-gstreamer($), gettext, libbluray, db, swi-prolog, qdbm, swig2.0($)
  43. highlight, libselinux, talloc, libhdate, libftdi, libplist, python-qt4, libprelude, libsemanage, php5
  44. samba, usbmuxd, libvpx, lirc, bsdmainutils, libiptcdata, libgtop2, libgsf, telepathy-glib, libwnck3, libnotify, libunique3, gnome-desktop3, gmime, glib2.0, json-glib, libgnomecanvas, libcanberra, orbit2, udisks, d-conf, libgusb
  45. libgnomeprintui, libimobiledevice, nautilus($), libbonobo, librsvg
  46. evas, wxwidgets2.8, upower, gnome-disk-utility, libgnome
  47. ecore, libbonoboui
  48. libgnomeui
  49. graphviz
  50. exiv2, libexif, lapack, soprano, libnl3, dbus-c++
  51. atlas, libffado, graphicsmagick, libgphoto2, network-manager
  52. pygtk, jackd2, sane-backends, djvulibre
  53. libav($), gpac($), ntrack, python-imaging, imagemagick, dia
  54. x264($), matplotlib, iceweasel
  55. strigi, opencv, libproxy($), ffms2
  56. kde4libs, frei0r, glib-networking
  57. kde-baseapps, kate, libsoup2.4
  58. geoclue, kde-runtime, totem-pl-parser, libgweather, librest, libgdata
  59. webkit
  60. zenity, gnome-online-accounts
  61. metacity, evolution-data-server
  62. gnome-panel
  63. tracker
The final recompilation of profile built source packages is omitted. Source packages marked with a ($) are selected to be profile built. All source packages listed in the same line can be built in parallel as they do not depend upon each other. This order looks convincing as it first compiles a multitude of source packages which have no or only few build dependencies lacking. Later steps allow fewer source packages to be compiled in parallel. The amount of needed build dependencies is highest in the source packages that are built last.

25 November 2012

Pietro Abate: Mini Debian Conf 2012 in Paris : Bootstrapping Debian for a new architecture

I just finished to address the awesome debian crowd at the Mini Deb conf in paris. My presentation was about a few challenges we have ahead to bootstrap debian on a new architecture. Johannes Schauer and Wookey did a lot of work in the last few months particularly focusing on Linaro/Ubuntu. After Wheezy I think it is important to catch up with their work and integrate it in debian. The two main take away messages from my presentation : This week we just reached an important milestone toward a fully automatic bootstrap procedure. Hopefully we are going to tell you more about this work during fosdem 2013 My slides are attached.
AttachmentSize
debminiconf2012.pdf271.56 KB

15 October 2012

Pietro Abate: Improved dose-builddebcheck

The new release of dose, apart from a few bug fixes, ships a new and improved version of dose-builddebcheck (man page). All the improvements done to dose-builddebcheck are from a set of patches submitted by Johannes Schauer in the context of the Bootstrap GSoC. We are still actively working on this. I invite you to read josh's recent blog post on the topic. For more background and discussion, you can have a look at the archives of this mailing list. Dose-builddebcheck (or buildcheck for short) is similar in intent to dose-debcheck but for source packages. Buildcheck allows you to check, just by looking at the Source and Packages files, if all the dependencies can be satisfied and installed, all this in a matter of seconds (less then 30 secs to test all the source packages in sid on my laptop). A simple example :
deb-buildcheck.native \
--deb-native-arch=amd64 \
tests/DebianPackages/Sid-amd64-Packages-050812.bz2
tests/DebianPackages/Sid-Sources-050812.bz2

native-architecture: amd64
background-packages: 55945
foreground-packages: 18175
broken-packages: 64

real 0m22.223s
user 0m21.937s
sys 0m0.204s
This will check for all source packages in the Source file if their build dependencies can satisfied on amd64 given the Packages binary file. This is nice but nothing new. A script based on the old edos tools has been around for quite a long time. We want more ! The new exciting feature brought by Johannes is the capability of checking source packages for cross-compilation :
dose-builddebcheck -f -e -s --checkonly picolisp \
--deb-native-arch=amd64 \
--deb-foreign-archs=armel \
--deb-host-arch=armel \
tests/DebianPackages/Sid-amd64-Packages-050812.bz2 \
tests/DebianPackages/Sid-armel-Packages-050812.bz2 \
tests/DebianPackages/Sid-Sources-050812.bz2

(I)Sources: Parsing Sources file tests/DebianPackages/Sid-Sources-single-version-050812.bz2...
(I)Format822: total packages 17723
(I)Format822: Merging repositories
(I)Packages: Parsing Packages file tests/DebianPackages/Sid-armel-Packages-050812.bz2...
(I)Packages: Parsing Packages file tests/DebianPackages/Sid-amd64-Packages-050812.bz2...
(I)Format822: total packages 58095

native-architecture: amd64
foreign-architecture: armel
host-architecture: armel
report:
-
package: src:apicolisp
version: 3.1.0.7-1
architecture: kopensolaris-amd64,solaris-amd64,amd64,any-i386,any-armel,any-armeb,any-arm,any-avr32,any-hppa,any-m32r,any-m68k,any-mips,any-mipsel,any-powerpc,any-s390,any-sh3,any-sh3eb,any-sh4,any-sh4eb,any-sparc,any-armhf
source: picolisp (= 3.1.0.7-1)
status: ok

background-packages: 75818
foreground-packages: 17723
broken-packages: 0
This will check, if the source package 'picolisp' in the Sources file can be cross compiled for 'armel' on the native architecture 'amd64' given the list of binary packages in the Packages file. The generated report is, as for dose-debcheck, encoded in yaml and it can be simply parsed using an off-the-shelf library. Apt is the canonical tool that can be used to check if a package can be cross compiled. Josh found a few discrepancies between dose and apt results. This was a very good test for both tools: Bug #683786 is a very interesting read... Dose 3.1.1 (the latest release) should land soon in experimental. Otherwise you can get it from our (new) project homepage.

13 October 2012

Johannes Schauer: Does it become harder to bootstrap Debian?

My last post explained how I retrieved and corrected data from snapshot.debian.org so that dose3 was able to parse it. In this post I will cover some surprising results I found when using my tools on those Packages and Sources files from 2005 until today. For each pair of Packages and Sources files I did the following:
  1. created a reduced distribution
  2. calculated the dependency graph
I call a reduced distribution the smallest set of binary and source packages with the following properties: Creating a reduced distribution first, greatly increases the execution speed of my algorithms as it reduces the amount of binary and source packages by an order of magnitude while still preserving the dependency cycle situation of the core packages. In many cases, once the packages of a reduced distribution are available, all the rest of Debian can be compiled from them without any dependency cycles. As also mentioned in earlier posts, there is always one central, big strongly connected component (SCC) in the dependency graph. I am especially interested in how the size of the reduced distribution and the SCC change over time as both are an indication of: Lets look at the plots I did from the data I gathered. The gray data points indicate that at that point in time, one or more of the core source packages (the ones in the reduced distribution) in Debian Sid was not compilable. This means that the resulting values cannot be fully trusted. But as it is mostly only a single source package that doesnt compile, it doesnt influence the overall result much and therefor I included them anyways. Red and green data points represent a fully successful run. The only thing that I do not yet understand is what happened in 2007... So while a potential porter in 2005 only had to look at a graph of 150 nodes, he now needs to solve a graph of nearly 1000 nodes. The amount of edges in the dependency graph grew even more dramatic from about 500 to over 8000 edges. While the dependency situation for Debian Sid in 2005 can easily be printed using xdot and visually solved, this in not possible anymore in 2012. While dependencies of only a few dozen source packages had to manually be dropped in 2005, now even dropping build dependencies from a few hundred source packages doesnt solve the dependency situation. So my assumption is, that due to a growing amount of interdependencies between source and binary packages (as both gain more features), bootstrapping Debian for a new architecture becomes harder over time. Is this also the perceived subjective impression of people that ported Debian in the past? If my assumption is correct, then there is a growing need for official support of droppable build dependencies (or "stage builds" or "profile builds") to break dependency cycles during the bootstrapping process. Work of a porter would be much easier if source packages would already contain information about what build dependencies can be dropped (if so needed). In the best case, a machine could use those annotations to calculate a build order automatically. As one can see in the graph above, there are currently 370 source packages in the main SCC. This means that no more than this amount of packages (but probably much less) have to be annotated to break the SCC into a directed acyclic graph. Discussion about what syntax to use to mark potentially droppable build dependencies currently happens in bug#661538 but should maybe be discussed by a wider audience. The currently favored solution was proposed in said bugreport by Guillem Jover and is called "build profiles". It has the advantage that it is not only trivial to implement (a patch exist for dpkg and dose3 already supports them) but would also be useful for other purposes like embedded builds. The format is similar to how architecture restrictions for individual dependencies are specified but uses "triangular brackets":
Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny
The work Patrick McDermott did for his GSoC project over the summer already uses above syntax.

12 October 2012

Johannes Schauer: analyzing Packages and Sources from snapshot.debian.org

When I wanted to use my dependency graph analysis tools to analyze earlier states of Debian Sid, I naturally used snapshot.debian.org to retrieve the Packages and Sources files from which my tools retrieve the dependency information. The problem is, that many of those Packages and Sources files contain syntax errors that make the dose3 parser choke. This leads to my tools being unable to parse the affected files. The following script does not only download all Packages and Sources files in a five day interval (4460 MB from 2005/03/12 to 2012/10/11) but also cleans all the syntax errors that were not parsable by dose3. This includes invalid version naming, architecture lists separated by commas, disjunctions in Conflict fields and incorrect braces/bracket usage. Maybe this helps others who also want to profit from Packages and Sources files from the past. Fun fact #1: starting from June 2010, there were no more syntax errors in the Packages and Sources files of Debian Sid. Fun fact #2: starting from December 2009, there are no more mismatches between versions of binary packages in the Packages file and the versions of the corresponding source packages in the Sources file. <script src="https://gist.github.com/3880904.js"> </script>

10 October 2012

Johannes Schauer: Using Gentoo to find reduced build dependencies for Debian source packages

Automatically devising a build order that allows to bootstrap Debian, currently fails (amongst other reasons) because of the lack of metadata information about which build dependencies can potentially be dropped from source packages. If that information was available, an algorithm could decide which build dependencies to drop so that dependency cycles can be broken. Finding droppable build dependencies of a source package is something only humans can do. This is because it involves to manually analyze and test the build system of a source package. Build systems are neither uniform nor do they encode their dependencies in a way that can directly be mapped to Debian packages. Therefor they are not machine readable. One idea to solve the dilemma, is to find a Linux distribution that provides the following: If such a distribution can be found then the information from it can be used to find dependencies that can also be dropped from Debian source packages. Gentoo is a distribution that fulfills above requirements through so called USE flags that allow to enable or disable features during compilation. Dependencies of Gentoo source packages are stored in .ebuild files that control the build process. Since .ebuild files are bash scripts, parsing them is not trivial. I therefor used the emerge software package to extract that information. Thanks to the well written emerge code and to quick help in the Gentoo IRC channel, it didnt take long to make the code run on Debian. My sourcecode is downloadable here: https://gitorious.org/debian-bootstrap/gen2deb Before I list the results of using Gentoo USE flags to determine dependencies that can potentially be dropped from Debian source packages, let me list the problems that this method entails.

Only package name matching, no version matching When writing the mapping from Debian to Gentoo packages and back I discard version information. There are just too many versions that either Debian or Gentoo have and are not present in the other. So the assumption is, that Debian Sid and Gentoo have both the most recent major versions of upstream software which has roughly the same requirements in terms of build dependencies.

Gentoo packages are matched to Debian source packages In Gentoo there are only source packages and no binary packages. So I map Gentoo packages to Debian source packages. But Gentoo source packages build depend on other source packages while Debian source packages depend on binary packages. So at some point I have to translate Gentoo packages to Debian source packages and those source packages to Debian binary packages. I do this by analyzing the original binary package build dependencies of a Debian source package and then filter out those binary packages as being droppable that are built by the Debian source packages that were found to be droppable.

Not the exact same package set There is some software that is only in Gentoo and some that is only in Debian. Debian and Gentoo also split some source packages differently.

Gentoo has more direct dependencies Many build dependencies in Debian are indirectly pulled in through dependencies of direct build dependencies. In Gentoo source packages directly depend on most things they need to build successfully. This leads to the list of dependencies in Gentoo to be much larger than the list of dependencies in the corresponding Debian source package. It also means that lots of dependencies that can be dropped in Gentoo are not found to be droppable in Debian because they are not direct dependencies of that source package.

There are no implicit dependencies Gentoo will often drop dependencies that are essential or build-essential packages in Debian and are therefor implicit build dependencies that cannot be dropped.

Result Despite the many problems, the result doesnt look too wrong. I got some Debian source packages that were found to have droppable build dependencies from Thorsten Glaser and all dependencies that Gentoo found to be droppable were also dropped by him. To put everything into numbers: the current 912 nodes big SCC in Debian Sid can be reduced to 6 individual SCC with 422, 5, 5, 3, 2 and 2 nodes each. So using Gentoo cuts the size of the central component to more than half. Surely, there will be a number of dependencies that were found to be droppable in Gentoo but are actually not droppable in Debian. The point is, that it is better to have "some" data even if it contains false positives than no data at all. It is easier for a human to verify if some suggested droppable build dependencies are actually correct than going through hundreds of source packages with thousands of dependencies manually.

28 September 2012

Johannes Schauer: visit at Paris IRILL

Last week, I was invited to give a talk about my Debian bootstrapping efforts at IRILL in Paris. The slides of my talk are online as pdf. The time I spent in Paris with Pietro Abate was very fruitful. I have to thank him and Roberto di Cosmo for inviting me and even compensating for my travel expenses. The things we actually managed to implement during my visit: The two most important things (in my opinion) that we came up with for future implementation, were the idea to harvest reduced build dependency information from Gentoo as well as finding a flaw in the way the current dependency graph relates to binary packages and their installation sets. I am still busy with evaluating output from my trials with Gentoo, so I will cover this topic in a later blog post once I generated the actual impact on the dependency graph. The current status is, that Gentoo USE flags allow me to find possibly droppable build dependencies for 250 out of the 350 interesting Debian source packages that are part of the main scc. Pietro also found a current flaw in how the dependency graph is generated. While source package A and B might both depend on binary package C (and its installation set), it is wrong to add a dependency to C from both source packages without further verification. Due to virtual packages and disjunctive dependencies, C might have many possible installation sets. Only one of them is chosen in the current code. The problem is, that this one chosen set might conflict with the other build dependencies of source packages depending on C. Therefor, there must exist multiple binary package nodes C, each with a different installation set, dynamically generated as they are needed. Source packages must point to the node for C that possesses an installation set that doesnt conflict with its own build dependencies. Other TODO notes that we came up with and that I will be implementing are: As I cannot always depend on new dose3 versions being pushed to Debian Sid right after their release, I will do the dose3 git submodule integration over the weekend. This will allow me to evaluate the results I got from my evaluation of Gentoo USE flags that I gathered over the past week.

8 August 2012

Johannes Schauer: Bootstrappable Debian - How to help

TLDR: multiarch, multiarch, multiarch, cross buildability, staged build dependencies, wiki page, corrections/hints/requests to debian-bootstrap at lists.mister-muffin.de This summer (and this year's GSoC) is nearing its end and to make it easier for people to make use of the information my tools produced so far, I created a page in the Debian wiki. It lists not only the open issues I see but also statistics that I gathered using the output of my GSoC project. I want to use this blog post to make people aware of that page as well as to get some feedback on it and anything related to it. The biggest blocker my tools face, is that many packages are still missing multiarch information. As long as at least the basic packages do not have their cross build dependencies satisfied via multiarch for an existing foreign architecture, automated tools can not properly analyze the dependency situation in the bootstrapping case, when many packages of the new foreign architecture do not even exist yet. If Debian is supposed to be bootstrappable, then the first stage is to make a set of basic packages cross compile for an existing foreign architecture. Once this is possible, a tool of mine can analyze the cyclic build dependency situation that might occur when cross compiling for an architecture that does not exist yet. Then, staged cross builds can be used to cross compile a minimal foreign system. Due to missing multiarch classification, it is not known yet how big the cyclic build dependency situation is for the base packages. It is not only the conversion of packages to multiarch that is needed but also the adding of the :any (and rare cases :native) qualifier to build dependencies on M-A: allowed packages. Prominent build dependencies that should (but are not yet) be M-A: allowed are python and gettext. Both are needed as a build dependency by many packages of the base system. Unfortunately wanna-build does not understand qualifiers like :any and :native yet. Until it does, no package can be marked :any or :native and cross compilation of many base packages can not succeed. Once the point is reached, where a base system can be cross compiled from nothing, native compilation can start. Since native compilation doesnt depend on multiarch, the dependency situation when trying to natively compiling all of Debian from nothing is understood much better. Unfortunately, the cyclic build dependency situation is also much worse in the native case and there exists a big 1000 node strongly connected component of binary and source packages that all interdepend on each other. This dependency mess can be solved using three approaches: The wiki page gives many hints on how to find packages that each method can be applied to. Stage building is a tool that might be useful for cross building (we dont know for sure yet) but is definitely needed for native compilation. It is needed for native compilation because after all possible dependencies are moved to Build-Depends-Indep, the only other alternative to stage building for breaking dependency cycles is to cross build source packages. Since building a package without one of its build dependencies "staged" is often much easier than making the package in question cross compile, it is a preferred alternative. Once more packages have been made multiarch, it might be possible to prove that there is no alternative to introducing a notion of staged builds. Some people (wookey, Patrick McDermott, Guillem Jover, myself) decided that the following format to mark staged build dependencies would be preferred over others:
Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny
The <> format was proposed by Guillem Jover in bug#661538. Patches for dpkg and dose3 are done. More people need to discuss about this format for a final decision on how to indicate staged build dependencies. For more information on the topic, have a look at the corresponding wiki page. Feel free to direct any comments/critique/hints to debian-bootstrap at lists.mister-muffin.de or directly to me.

30 July 2012

Johannes Schauer: port bootstrap build-ordering tool report 4

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de.

Diary

July 2
  • playing around with syntactic dependency graphs and how to use them to flatten dependencies

July 4
  • make work with dose 3.0.2
  • add linux-amd64 to source architectures
  • remove printing in build_compile_rounds
  • catch Not_found exception and print warning
  • use the whole installation set in crosseverything.ml instead of flattened dependencies
  • detect infinite loop and quit in crosseverything.ml
  • use globbing in _tags file
  • use wildcards and patsubst in makefile

July 5
  • throw a warning if there exist binary packages without source packages
  • add string_of_list and string_of_pkglist and adapt print_pkg_list and print_pkg_list_full to use them
  • fix and extend flatten_deps - now also tested with Debian Sid

July 6
  • do not exclude the crosscompiled packages from being compiled in crosseverything.ml
  • clean up basebuildsystem.ml, remove old code, use BootstrapCommon code
  • clean up basenocycles.ml, remove unused code and commented out code
  • add option to print statistics about the generated dependency graph
  • implement most_needed_fast_wrong as well as most_needed_slow_correct and make both available through the menu

July 7
  • allow to investigate all scc, not only the full graph and the scc containing the investigated package
  • handle Not_found in src_list_from_bin_list with warning message
  • handle the event of the whole archive actually being buildable
  • replace raise Failure with failwith
  • handle incorrectly typed package names
  • add first version of reduced_dist.ml to create a self-contained mini distribution out of a big one

July 8
  • add script to quickly check for binary packages without source package
  • make Debian Sid default in makefile
  • add *.d.byte files to .gitignore
  • README is helpful now
  • more pattern matching and recursiveness everywhere

July 9
  • fix termination condition of reduced_dist.ml
  • have precise as default ubuntu distribution
  • do not allow to investigate an already installable package

July 10
  • milestone: show all cycles in a graph
  • add copyright info (LGPL3+)

July 11
  • advice to use dose tools in README

July 16
  • write apt_pkg based python filter script replacing grep-dctrl

July 17
  • use Depsolver.listcheck more often
  • add dist_graph.ml
  • refactor dependency graph code into its own module

July 18
  • improve package selection for reduced_dist.ml
  • improve performance of cycle enumeration code

July 20
  • implement buildprofile support into dose3

July 22
  • let dist_graph.ml use commandline arguments

July 23
  • allow dose3 to generate source package lists without Build- Depends Conflicts -Indep

July 29
  • implement crosscompile support into dose3

Results

Readme There is not yet a writeup on how everything works and how all the pieces of the code work together but the current README file provides a short introduction on how to use the tools.
  • build and runtime dependencies
  • compile instructions
  • execution examples for each program
  • step by step guide how to analyze the dependency situation
  • explanation of general commandline options
A detailed writeup about the inner workings of everything will be part of a final documentation stage.

License All my code is now released under the terms of the LGPL either version 3, or (at your option) any later version. A special linking exception is made to the license which can be read at the top of the provided COPYING file. The exception is necessary because Ocaml links statically, which means that without that exception, the conditions of distribution would basically equal GPL3+.

reduced_dist.ml Especially the Debian archive is huge and one might want to work on a reduced selection of packages first. Having a smaller selection of the archive would be significantly faster and would also not add thousands of packages that are not important for an extended base system. I call a reduced distribution a set of source packages A and a set of binary packages B which fulfill the following three properties:
  • all source packages A must be buildable with only binary packages B being available
  • all binary packages B except for architecture:all packages must be buildable from source packages A
The set of binary packages B and source packages A can be retrieved using the reduced_dist program. It allows to either build the most minimal reduced distribution or one that includes a certain package selection. To filter out the package control stanzas for a reduced distribution from a full distribution, I originally used a call to grep-dctrl but later replaced that by a custom python script called filter-packages.py. This script uses python-apt to filter Packages and Sources files for a certain package selection.

dist_graph.ml It soon became obvious that there were not many independent dependency cycle situation but just one big scc that would contain 96% of the packages that are involved in build dependency cycles. Therefor it made sense to write a program that does not iteratively build the dependency graph starting from a single package, but which builds a dependency graph for a whole archive.

Cycles I can now enumerate all cycles in the dependency graph. I covered the theoretical part in another blog post and wrote an email about the achievement to the list. Both resources contain more links to the respective sourcecode. The dependency graph generated for Debian Sid has 39486 vertices. It has only one central scc with 1027 vertices and only eight other scc with 2 to 7 vertices. All the other source and binary packages in the dependency graph for the archive are degenerate components of length one. Obtaining the attached result took 4 hours on my machine (Core i5 @ 2.53GHz). 1.5 h of that were needed to build the dependency graph, the other 2.5 hours were needed to run johnson's algorithm on the result. Memory consumption of the program was at about 700 MB. It is to my joy that apparently the runtime of the cycle finding algorithm for a whole Debian Sid repository as well as the memory requirements are within orders of magnitude that are justifiable when being run on off-the-shelf hardware. It must also be noted that nothing is optimized for performance yet. A list of all cycles in Debian Sid up to length 4 can be retrieved from this email. This cycle analysis assumes that only essential packages, build-essential and dependencies and debhelper are available. Debhelper is not an essential or build-essential package but 79% of the archive build-depends on it. The most interesting cycles are probably those of length 2 that need packages that they build themselves. Noticeable examples for these situations are vala, python, mlton, fpc, sbcl and ghc. Languages seem love to need themselves to be built.

Buildprofiles There is a long discussion of how to encode staged build dependency information in source packages. While the initial idea was to use Build-Depends-StageN fields, this solution would duplicate large parts of the Build-Depends field, which leads to bitrot as well as it is inflexible to possible other build "profiles". To remedy the situation it was proposed to use field names like Build-Depends[stage1 embedded] but this would also duplicate information and would break with the rfc822 format of package description files. A document maintained by Guillem Jover gives even more ideas and details. Internally, Patrick and me decided for another idea of Guillem Jover to annotate staged build dependencies. The format reads like:
Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny
So each build profile would follow a dependency in <> "brackets" an have a similar format as architecture options. Patrick has a patch for dpkg that implements this functionality while I patched dose3.

Dropping Build-Depends-Indep and Build-Conflicts-Indep When representing the dependencies of a source package, dose3 concatenates its Build-Depends and Build-Depends-Indep dependency information. So up to now, a source package could only be compiled, if it manages to compile all of its binary packages including architecture:all packages. But when bootstrapping a new architecture, it should be sufficient to only build the architecture dependent packages and therefor to only build the build-arch target in debian/rules and not the build-indep target. Only considering the Build-Depends field and dismissing the Build-Depends-Indep field, reduced the main scc from 1027 vertices to 979 vertices. The amount of cycles up to length four reduced from 276 to 206. Especially the cycles containing gtk-doc-tools, doxygen, debiandoc-sgml and texlive-latex-base got much less. Patrick managed to add a Build-Depends-Indep field to four packages so far which reduced the scc further by 14 vertices down to 965 vertices. So besides staged build dependencies and cross building there is now a third method that can be applied to break dependency cycles: add Build-Depends-Indep information to them or update existing information. I submitted a list of packages that have a binary-indep and/or a build-indep target in their debian/rules to the list. I also submitted a patch for dose3 to be able to specify to ignore Build-Depends-Indep and Build-Conflicts-Indep information.

Dose3 crossbuilding So far I only looked at dependency situations in the native case. While the native case contains a huge scc of about 1000 packages, the dependency situation will be much nicer when cross building. But dose3 was so far not able to simulate cross building of source packages. I wrote a patch that implements this functionality and will allow me to write programs that help analyze the cross-situation as well.

Debconf Presentation Wookey was giving a talk at debconf12 for which I was supplying him with slides. The slides in their final version can be downloaded here

Future Patrick maintains a list of "weak" build dependencies. Those are dependencies that are very likely to be droppable in either a staged build or using Build-Depends-Indep. I must make use of this list to make it easier to find packages that can easily be removed of their dependencies. I will have to implement support for resolving the main scc using staged build dependencies. Since it is unlikely that Patrick will be fast enough in supplying me with modified packages, I will need to create myself a database of dummy packages. Another open task is to allow to analyze the crossbuilding dependency situation. What I'm currently more or less waiting on is the inclusion of my patches into dose3 as well as a decision on the buildprofile format. More people need to discuss about it until it can be included into tools as well as policy. Every maintainer of a package can help making bootstrapping easier by making sure that as many dependencies as possible are part of the Build-Depends-Indep field.

4 July 2012

Johannes Schauer: enumerating elementary circuits of a directed_graph

For my GSoC project this year I need to be able to enumerate all elementary circuits of a directed graph. My code is written in Ocaml but neither the ocamlgraph library nor graph libraries for other languages seem to implement a well tested algorithm for this task. In lack of such a well tested solution to the problem, I decided to implement a couple of different algorithms. Since it is unlikely that different algorithms yield the same wrong result, I can be certain enough that each individual algorithm is working correctly in case they all agree on a single solution. As a result I wrote a testsuite, containing an unholy mixture of Python, Ocaml, D and Java code which implements algorithms by D. B. Johnson, R. Tarjan, K. A. Hawick and H. A. James.

Algorithm by R. Tarjan The earliest algorithm that I included was published by R. Tarjan in 1973.
Enumeration of the elementary circuits of a directed graph
R. Tarjan, SIAM Journal on Computing, 2 (1973), pp. 211-216
http://dx.doi.org/10.1137/0202017
I implemented the pseudocode given in the paper using Python. The git repository can be found here: https://github.com/josch/cycles_tarjan

Algorithm by D. B. Johnson The algorithm by D. B. Johnson from 1975 improves on Tarjan's algorithm by its complexity.
Finding all the elementary circuits of a directed graph.
D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
http://dx.doi.org/10.1137/0204007
In the worst case, Tarjan's algorithm has a time complexity of O(n e(c+1)) whereas Johnson's algorithm supposedly manages to stay in O((n+e)(c+1)) where n is the number of vertices, e is the number of edges and c is the number of cycles in the graph. I found two implementations of Johnson's algorithm. One was done by Frank Meyer and can be downloaded as a zip archive. The other was done by Pietro Abate and the code can be found in a blog entry which also points to a git repository. The implementation by Frank Meyer seemed to work flawlessly. I only had to add code so that a graph could be given via commandline. The git repository of my additions can be found here: https://github.com/josch/cycles_johnson_meyer Pietro Abate implemented an iterative and a functional version of Johnson's algorithm. It turned out that both yielded incorrect results as some cycles were missing from the output. A fixed version can be found in this git repository: https://github.com/josch/cycles_johnson_abate

Algorithm by K. A. Hawick and H. A. James The algorithm by K. A. Hawick and H. A. James from 2008 improves further on Johnson's algorithm and does away with its limitations.
Enumerating Circuits and Loops in Graphs with Self-Arcs and Multiple-Arcs.
Hawick and H.A. James, In Proceedings of FCS. 2008, 14-20
www.massey.ac.nz/~kahawick/cstn/013/cstn-013.pdf
In contrast to Johnson's algorithm, the algorithm by K. A. Hawick and H. A. James is able to handle graphs containing edges that start and end at the same vertex as well as multiple edges connecting the same two vertices. I do not need this functionality but add the code as additional verification. The paper posts extensive code snippets written in the D programming language. A full, working version with all pieces connected together can be found here: https://github.com/josch/cycles_hawick_james The algorithm was verified using example output given in the paper. The project README states how to reproduce it.

Input format All four codebases have been modified to produce executables that take the same commandline arguments which describes the graphs to investigate. The first argument is the number of vertices of the graph. Subsequent arguments are ordered pairs of comma separated vertices that make up the directed edges of the graph. Lets look at the following graph as an example: cycle example The DOT source for this graph is:
digraph G  
  0;
  1;
  2;
  0 -> 1;
  0 -> 2;
  1 -> 0;
  2 -> 0;
  2 -> 1;
   
To generate the list of elementary circuits using Tarjan's algorithm for the graph above, use:
$ python cycles.py 3 0,1 0,2 1,0 2,0 2,1
0 1
0 2
0 2 1
The commandline arguments are the exact same for the other three methods and yield the same result in the same order. If the DOT graph is in a format as simple as above, the following sed construct can be used to generate the commandline argument that represents the graph:
$ echo  sed -n -e '/^\s*[0-9]\+;$/p' graph.dot   wc -l   sed -n -e 's/^\s*\([0-9]\) -> \([0-9]\);$/\1,\2/p' graph.dot 

Testsuite As all four codebases take the same input format and have the same output format, it is now trivial to write a testsuite that compares the individual output of each algorithm for the same input and checks for differences. The code of the testsuite is available via this git repository: https://github.com/josch/cycle_test The other four repositories exist as submodules of the testsuite repository.
$ git clone git://github.com/josch/cycle_test.git
$ cd cycle_test
$ git submodule update --init
A testrun is done via calling:
$ ./test.sh 11
The argument to the shell script is an integer denoting the maximum number N of vertices for which graphs will be generated. The script will compile the Ocaml, Java and D sourcecode of the submodules as well as an ocaml script called rand_graph.ml which generates random graphs from v = 1..N vertices where N is given as a commandline argument. For each number of vertices n, e = 1..M number of edges are chosen where M is maximum number of edges given the number of vertices. For every combination of number of vertices v and number of edges e, a graph is randomly generated using Pack.Digraph.Rand.graph from the ocamlgraph library. Each of those generated graphs is checked for cycles and written to a file graph-v-e.dot if the graph contains a cycle. test.sh then loops over all generated dot files. It uses the above sed expression to convert each dot file to a commandline argument for each of the algorithms. The outputs of each algorithm are then compared with each other and only if they do not differ, does the loop continue. Otherwise the script returns with an exit code. The tested algorithms are the Python implementation of Tarjan's algorithm, the iterative and functional Ocaml implementation as well as the Java implementation of Johnson's algorithm and the D implementation of the algorithm by Hawick and James.

Future developments Running the testsuite with a maximum of 12 vertices takes about 33 minutes on a 2.53GHz Core2Duo and produces graphs with as much as 1.1 million cycles. It seems that all five implementations agree on the output for all 504 generated graphs that were used as input. If there should be another implementation of an algorithm that enumerates all elementary circuits of a directed graph, I would like to add it. There are some more papers that I would like to read but I lack access to epubs.siam.org and ieeexplore.ieee.org and would have to buy them. Benchmarks seem a bit pointless as not only the algorithms are very different from each other (and there are many ways to tweak each of them) but also the programming languages differ. Though for the curious kind, it follows the time each algorithm takes to enumerate all cycles for all generated graphs up to 11 vertices.
algorithmtime (s)
Johnson, Abate, Ocaml, iterative137
Johnson, Abate, Ocaml, functional139
Tarjan, Python153
Hawick, D175
Johnson, Meyer, Java357
The iterative Ocaml code performs as well as the functional one. In practice, the iterative code should probably be preferred as the functional code is not tail recursive. On the other hand it is also unlikely that cycles ever grow big enough to make a difference in the used stack space. The Python implementation executes surprisingly fast, given that Tarjan's algorithm is supposedly inferior to Johnson's and given that Python is interpreted but the Python implementation is also the most simple one with the least amount of required datastructures. The D code potentially suffers from the bigger datastructures and other bookkeeping that is required to support multi and self arcs. The java code implements a whole graph library which might explain some of its slowness.

2 July 2012

Johannes Schauer: port bootstrap build-ordering tool report 3

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de.

Diary

June 18 Pietro suggests a faster way to generate installation sets for a list of packages. In my case, I need an installation set for every source package in the archive to find out how often a binary package is needed to build a source package. As a result, the speed is doubled in contrast to the original approach.

June 19
  • adapt code to work with new dose release 3.0
  • remove unneeded parts of code
  • add different possibilities to find amount of source packages that need a binary package
  • add code to get multiple installation sets using Depsolver_int.solve

June 20
  • add ~global_constraints:false to Depsolver.listcheck, Depsolver.trim and Depsolver.edos_install calls
  • adapt output graph to limited xdot capabilities

June 21 I formulate an email to the list, reporting of dependency graphs of debhelper, cdbs, pkg-config and libgtk2.0-dev. My current technique gets an installation set for a source package, removes all those that are already installable and adds the others as a dependency of that source package. This dependency will include an installation set of that binary as well minus all packages that are already available. The problem with that approach are dependency cycles created by long dependency chains. Example: src:A needs B needs C needs A. B and C would both be added as a dependency of src:A. B as well as C would also include their installation set which in both cases includes A. So now there are two cycles: src:A->B->A and src:A->C->A. For a real life example, look at the following situation of cdbs and src:sqlite3. cdbs old situation It is created because src:sqlite3 needs cdbs needs python-scour needs python needs python2.7 needs libsqlite3-0. Therfor libsqlite3-0 is in the installation set of cdbs, python-scour, python and python2.7. This creates five cycles in the graph even though there is only one. It would be better to reduce the dependencies added to src:sqlite3 to its direct dependency which is cdbs. Package dependencies are disjunctions from which the solver chooses one or the other to build an installation set. To solve the problem above I would need to know which disjunction the solver chose and then only add the direct dependency of a package to the dependency graph.
  • improve build_compile_rounds performance
  • big overhaul of menu structure
  • fix subgraph extraction code

June 22
  • do not create a universe if not needed - use hashtables instead
  • for sorting packages, generating difference of package sets and otherwise comparing packages, always use CudfAdd.compare
  • as a custom list membership function, use List.exists instead of trying List.find
  • more speedup for build_compile_rounds
  • the number of source packages that can be built does NOT include the cross built packages
  • print closure members in graph
  • refactor code and move common functions to bootstrapCommon.ml
  • add breakcycles.ml for future code to break cycles using staged build dependencies
  • use more extlib functionality
  • extended package list input format

June 23 After several emails with Pietro I learn about syntactic dependency graphs. To document my progress and prove my efforts I committed the code as commit 6684c13. But this code was soon found out to be unecessary so it will be removed later and just serves as documentation.

June 24 I came up with another (better?) solution to get the chosen disjunctions. It simply uses the calculated installation set to decide for each disjunction which one was taken by the solver. I reported that important step and the open questions involved with it in an email to the list. The problem always was, that an installation set can easily contain more than one package of a disjunction. In this case it is not clear which branch was chosen by the solver. I found, that in Ubuntu Natty there are only 6 such packages and for each of them the situation can be solved. It can be solved because in all of those cases it is that either one package of a disjunction provides the other or that both packages depend upon each other, which means that both have to be included.

June 27
  • use installation set to flatten build dependencies of source packages
  • refactor code and move common functions to bootstrapCommon.ml

June 25 I have to have an algorithm that finds all circuits in a given graph. This is necessary so that:
  1. cycles can be enumerated for huge dependency graphs where cycles are hard to see
  2. cycles can be enumerated to find a cycle that can be broken using staged build dependencies
It seems that Johnson's algorithm is the best way to do this complexity wise and Pietro already blogged about the problem together with an implementation of the algorithm in ocaml. Unfortunately it turns out that his code doesnt implement the algorithm correctly and hence misses out on some cycles. The fix seems not to be too trivial so I'm still investigating it.

June 28
  • add crosseverything.ml to obtain a list of source packages that, if cross compiled, would make the whole archive available

Results While the first week was productive as usual, I had to work some time on a University project during the second week as well as attend a family meeting. I will catch up with the lost time over the course of the next week.

dose3 Using dose 3.0 (which fixes a bug about essential packages) the output of the algorithms is now likely less wrong then before.

performance Performance was improved in the generation of installation sets as well as in the code that tries out how many packages can be built in multiple rounds. This was achieved by more caching, less unnecessary operations in looping constructs, replacing lists with hashtables, not creating universes where not necessary.

user interface The main program, basenocycles.ml now has a much better menu structure.

input format The programs take two package file inputs. The list of source packages that has to be cross built for a minimal build system and the list of source packages that was chosen to be cross compiled in addition to that. Both files list one source package per line and now allow comments.

refactoring As more and more scripts are added, more and more functionality is moved to bootstrapCommon.ml which makes each script much cleaner.

what to test for cross building As discussed in the "Future" section of the last report, I now automated the process of finding out which packages, if they were cross compiled, would make the whole archive available because they break all cycles and allow native compilation of the rest. The outcome: to build 3333 out of 3339 packages in natty natively, at most 186 source packages must be cross compiled. The other six cannot be compiled because of version mismatches in the Natty Sources.bz2 and Packages.bz2. The code can be run from crosseverything.ml.

limit source dependencies to direct dependencies Reducing the dependencies of source packages from their full installation set to their direct dependencies by finding out which disjunction of their dependency list were taken, greatly simplifies the dependency graphs. The dependency graph created for libgtk2.0-dev could be reduced from 491 to 247 vertices for a depth of three. For cdbs it is now clearly visible that cdbs depends on libsqlite3-0 which builds from src:sqlite3 which depends on cdbs. Before: cdbs old situation After: cdbs new situation For pkg-config the graph also has been reduced to the one single cycle that matters: src:pkg-config needs libglib2.0-dev which depends on pkg-config which builds from src:pkg-config. Before: pkg-config old situation After: pkg-config old situation

Future I will prepare content for wookey's debconf talk on crossbuilding and bootstrapping. As this will include directions how to use the current code, I will kill two birds with one stone and write some proper documentation for my current source. The following two lists will be displayed after a dependency graph is calculated and reduced to its scc:
  • those source packages that have the least build dependencies not fulfilled. Those might be candidates for easy staged build dependencies. Since the source package is part of the scc, it will definitely be involved in some cycle somewhere.
  • those binary packages that most source packages depend upon. Those could be candidates for cross compilation as it might be easier to cross compile the source package than using staged build dependencies.
Patrick managed to cross build packages with sbuild now. So the list of packages that crosseverything.ml produces can now be checked efficiently for cross buildability. With this list, potentially more cycles can be broken out of the box. A feature will be added that allows the user to remove all packages from a dependency graph that can be cross compiled without any additional effort. Version mismatches between source and binary packages in Sources.bz2 and Packages.bz2 respectively in Ubuntu make the scripts fail and/or produce wrong results. Debian (even Sid) doesnt have this problem so I should find out where to report this problem to Ubuntu. I need to write a working version of Johnson's algorithm because much functionality depends upon it. I have the option to improve Pietro's version or write one from scratch. Writing one from scratch might be easier as I have Pietro's code as template as well as a Java implementation of Johnson's algorithm which seems to work. The following functionalities need working cycle enumeration:
  • given source packages with staged build dependencies, an enumeration of cycles is needed to find out which cycles can be broken by building packages staged. It makes less sense to blindly build a package stage and then check if this makes more packages available.
  • display cycles of a dependency graph to the user. After obtaining all cycles in the graph it makes sense to sort them by their length. The user would then investigate the situation of the smallest cycles first. This makes sense because breaking small cycles can potentially break bigger cycles. Since in the end, all cycles have to be eliminated anyway, it makes sense for the user to first tackle the small ones.
  • display the feedback arc set to the user. The packages in the feedback arc set might be very good candidates for reduced build dependencies or cross compilation.

17 June 2012

Johannes Schauer: port bootstrap build-ordering tool report 2

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de. Diary June 4 I added the first version of basenocycles.ml to git. Given an initial set of cross built packages, it tries to compile as much as possible on the resulting system in multiple rounds. June 5 During June 3, I discovered and error in my program that would only come up when using the Debian Sid package lists as the input:
Fatal error: exception Assert_failure("common/edosSolver.ml", 610, 11)
On this day, June 5, I wrote a minimal test case for this problem. The same day, Pietro figured out that this is a bug in dose which will be fixed in the next release. Begin writing code to figure out how important a binary package is for the further build process. Try to use Depsolver.edos_install to find out what packages are needed to make debhelper available. Restructure basenocycles.ml, exclude source packages that already have been built, still trouble with already existing binary packages and Cudf.mem_installed, comment stuff better. June 6 I wrote some crude code (only estimate, not correct, fixed later) that would give a rough overview of how often a given binary package is directly required as a build dependency. Debhelper came out as the most needed package. It is architecture:all, so it does not have to be built but it has unfulfilled runtime dependencies. To make those packages available, 13 (actually 11, fixed later) packages have to be compiled on Ubuntu Natty. But those packages all (except gettext) require debhelper itself to be built. The first dependency cycle. This dependency cycle (actually, the 12 cycles) can be broken by either cross compiling those source packages or by making them build without debhelper. One goal of the program is to help decide what the easier option is, but this is not yet implemented. To play around a bit, I created the possibility to specify a list of packages that are additionally to the minimal set of cross compiled packages also cross compiled. I added the 13 packages found above to the list, thus making the binary packages they build available. This made debhelper available in the system. As a result, 1625 out of 3339 source packages can be built with just a minimal build system (priority:essential packages plus build-essential) and debhelper available. The next package that blocks the most other source packages from being built is cdbs. The next nine packages in that list also require cdbs so it seems to be the next important package to be available. Pietro's suggestions make me:
 - do not open BootstrapCommon but ExtLib, Common, Algo, Debian
 - do proper option parsing and logging
 - use Debcudf.ignore_essential = true
 - do Debcudf.init_tables (binlist@srclist)
 - use @ with shorter list first
 - use more List.rev_append instead of @
 - use CudfAdd.who_provides to find if a package is available
June 7 Pietro and Patrick suggest that for solving the debhelper cycles, one could provide a minimal debhelper version so that the above list of 12 packages can be built without debhelper. I try to figure out how to get a list of packages that are missing to make a package installable/buildable. This functionality should be provided in dose but I fail to find it. June 8 Lacking a solution of the problem of June 7, I write a mail to Pietro. I start my first graphs in ocaml using the ocamlgraph library. The graph I generate, starts off at a binary package. For each binary package it connects new vertices as its runtime dependencies. If a binary package is not arch:all and also not yet otherwise compiled, its source package is also added. The result is a graph in which set of source packages in it will make the initial package available, if those source packages would be cross compiled. The graph is extended further than the source packages. June 9 I refine a couple of functions, make univ_get_pkg_by_name return the package with the highest version number. I wrote a rather lengthy (1027 words) email to the list that explains my status as of this day. I can create graphviz dot files with ocaml, can create node and edge types and create the graph by an imperative pattern that I saw a lot in Pietro's code. Disjunctions are not yet handled correctly (see mail from June 8). The graphs generated look like the following: http://mister-muffin.de/p/8nyc.png June 11 I write a test case which shows how CudfAdd.who_provides doesnt find virtual packages. Automate the process of finding the packages that, if cross compiled, would make another package available. Add more predicates (identifying source packages) and improve input file reading code. Move build_compile_rounds which compiles as many source packages as it can in multiple rounds on a given minimal system a toplevel function and thereby abstract it more. Create a rudimentary text based menu to choose different actions to take for an analysis. Start writing an extended version of simple_dependency_graph for deeper analysis. Use xdot to show graphs from the text menu. Allow saving those graphs to a file. June 12 Move functionality from the extended version of simple_dependency_graph over to the normal version and delete the extended version. Add the new Closure vertex type. Create extended_dependency_graph which is supposed to not contain single binary package vertices but handle a package and its installation set as one vertex. The output of extended_dependency_graph is optionally reduced to the biggest (non degenerate) strongly connected component. User gets the option of choosing the exploration depth. June 13 Pietro replies to my mail from June 8 but apparently I failed to express myself well enough in my last mail, so I rephrase my question. Pietro replies to my email from June 11 and explains how the effect I see is due to "a nuisance of the debian to cudf encoding". As a result I change my code accordingly. June 14 Another lengthy (1130 words) email to the list. I explain what was done in the past days, what parts work and how they work. I list some rationales on why I did things the way I did them. The most important observation is, that after improving my code again and again, I ended up representing the dependency cycle problem in the same (very similar) way that Pietro suggested in the beginning. This is probably a good discovery. Lots of data of that email is now of only little use as of June 16, I make lots of improvements in correctness. As I dont have an answer to my other email to Pietro from June 13, I implement a very crude way to get an answer to the question of what packages are missing for a package to be available/compileable. I called it flatten_vpkgformula_best_effort and it suffers from many faults including disjunctions and package conflicts. Patrick spots a problem. As a result, I make sure that at no point, the source package of an arch:all package can be listed. June 15 As a reply to my mail from June 13, Pietro creates a new branch in the git and adds the code I needed to get a proper installation set. June 16 As a result of Pietro's efforts from June 15, I make great advancements on all fronts. Details of the current status follow in the next section. Results A big leap was made on June 16 due to Pietro's great help on making me understand how Depsolver.listcheck can be used for my purposes. My difficulties in finding the solution myself are rooted in many parts of the dose framework being poorly commented but Pietro did already a couple of documentation commits whenever things were unclear for me. Using Depsolver.listcheck makes it possible to be more distribution agnostic and I dont have to handle vpkgs, virtual packages and constraints myself anymore. The code also doesnt suffer anymore by wrongly analyzed dependencies and conflicts. The only thing that is not yet taken care of, is that Depsolver.listcheck only chooses one out of several possible installation set. A final version should be able to take into account that a different installation set could provide a better solution. Overall, in comparison to two weeks ago, I can now properly build, traverse and analyze graphs, can choose an installation set properly, understand more about dependencies, closures, dose and ocaml in general. Finding the importance of binary packages for building When calculating how many source packages are depending on the availability of a binary package I originally flattened the pkg.Cudf.depends list twice for a rough overview. This is of course wrong due to disjunctions and conflicts and also doesnt provide a deep dependency inspection. The new method is to calculate an installation set that is necessary to compile a source package for every source package. The resulting list of binary packages is then used to find out how often a binary package appears in an installation set. I see three drawbacks though: Removing simple graph The simple graph which contained single binary and source packages was removed. I realized it doesnt really serve any purpose to look at it. As a result, Bin vertices and InstallDep edges are also not part of the graph anymore. Since it was responsible for generating the list of source packages that have to be cross built to make a package available, I created a new function get_cross_source_packages which uses an installation to serve the same purpose. Fix extended_dependency_graph extended_dependency_graph now uses installation sets for generating the list of packages that is needed to compile a source package or install a binary package. The list of build dependencies does not include packages that are already installable. The list of runtime dependencies does not include packages that are otherwise available (cross built, arch:all...). Instead of checking for list membership all the time, I created hash tables for the list of installable as well as for the list of available binary packages. Future There are two big tasks for the next two weeks: Task one is to find a way to give hints on which packages to best consider for having reduced build dependencies. This would then probably finally make use of Pietro's cycle algorithms. Task two is to find a way to break cycles and create a build-DAG from a list of packages that already have staged build dependency information. Patrick is currently working on patching dpkg with Build-Depends-StageN dependencies as making perl cross compilable. If he doesnt need the ability to decide which packages to modify to have staged build dependencies in the near future, then task one is probably less urgent and therefor of less importance right now? On the other hand, I can easily generate fake reduced build dependencies so that doing task two right now would not be a problem. Also, having the solution for task two will make it possible to show the user what effect it would have to add reduced build dependencies to a package. For the reasons above (it's not urgent, task one profits from task two being solved) I will go and implement task two first (if there are no objections from my mentors). Another idea, that I discussed with wookey and Patrick yesterday, was that due to multiarch being used for more and more packages, there should exist a set of packages that is cross compilable without any change to the package. We agreed that I make a list of packages that, if cross compiled, would break dependency cycles and make other packages available. I created such a list of about 160 packages for Ubuntu Natty that, if cross compiled, made it possible to have 87% of Natty available (those numbers have to be treated with caution as I did not yet use the proper way of installation sets when generating that list, but the order of magnitude should be correct). Wookey can then try to cross compile those packages. If some packages of those "crucial" source packages are found to be cross compilable, then they should be cross compiled because it means that no work has to be done to break some cycles. Cross compiling all packages that are cross compilable out of the box is no solution, as only natively compiled packages can go into the archive. This is why the list of potentially additionally cross compiled source packages has to be kept as small as possible.

3 June 2012

Johannes Schauer: port bootstrap build-ordering tool report 1

A copy of this post is sent to soc-coordination@lists.alioth.debian.org as well as to debian-bootstrap@lists.mister-muffin.de. Diary May 21 May 22 May 23 May 24 May 25 May 29 May 30 May 31 June 1 June 2 Results I learned a good chunk of ocaml and how to use dose3 and libcudf. I created a gitorious project and a git repository for all the sourcecode.
git clone git://gitorious.org/debian-bootstrap/bootstrap.git
The git as of now contains 30 commits and 1197 lines of ocaml code. So far, 62 emails have been exchanged between me and Pietro and Wookey. I created a mailinglist for this project where all email exchange so far is publicly accessible in the archives. You can also download all of the email exchange in mbox format. Everybody is welcome to join and/or read the list. What seems to be finished: the program that finds the minimal amount of source packages that have to be cross compiled to end up with a minimal build system. What it does is:
  1. get all essential packages
  2. get their runtime dependencies
  3. get build-essential plus runtime dependencies
  4. get all source packages that are necessary to build 1.-3. those are the packages that have to be cross compiled
  5. get a list of all packages that are built by source packages from 4.
  6. add all packages from 1.,2.,3. and 5. plus all arch:all packages to a universe
  7. use Depsolver.trim on that universe to figure out which of those packages are actually installable
The result of 7. will then contain a list of packages that are available automatically on the foreign system due to cross compiled source packages and arch:all packages. For Debian Sid, the output of my program is:
# (1) number of packages with priority:required: 62
# (2) plus, number of dependencies of priority:required packages: 20
# (3) plus, build-essential and dependencies: 31
# number of source packages to build the above: 71
# number of additional packages built from the above source packages: 292
# (4) number of packages of those plus arch:all packages that are installable: 6421
# total number of installable packages (1)+(2)+(3)+(4): 6534
For Ubuntu Natty it is:
# (1) number of packages with priority:required: 96
# (2) plus, number of dependencies of priority:required packages: 7
# (3) plus, build-essential and dependencies: 31
# number of source packages to build the above: 87
# number of additional packages built from the above source packages: 217
# (4) number of packages of those plus arch:all packages that are installable: 2102
# total number of installable packages (1)+(2)+(3)+(4): 2236
So for Debian, 71 source packages definitely have to be made cross compilable while for Natty, the number is 87. The last two days I was toying around with these minimal systems to see how big the number of source packages is, that can be built on top of them without running into dependency cycles. After installing the binary packages that were built, I checked again until no new packages could be built. For Natty, I was only able to find 28 additional packages that can be built on top of the 2236 existing ones. This means that a number of dependency cycles prevent building anything else. In the coming two weeks I will focus on coming up with a tool that cleverly helps the user to identify packages that would be useful to have for building more packages (probably determined by how many packages depend on it - debhelper is an obvious candidate). The tool would then show why that crucial package is not available (in case of debhelper because some of its runtime dependencies are not available and require debhelper to be built) and how the situation can best be resolved. The possible methods to do so are to identify a package that is part of a cycle and either cross compile it or let it have staged build dependencies.

Johannes Schauer: cross-compilable and bootstrappable Debian

When packaging software for Debian, there exist two important assumptions:
  1. Compilation is done natively
  2. Potentially all of Debian is available at compile time
Both assumptions make the life of a package maintainer much easier and they do not create any problem unless you are one of the unlucky few who want to run Debian on an architecture that it does not yet exist for. You will then have to use other distributions like OpenEmbedded or Gentoo which you compiled (or retrieved otherwise) for that new architecture to hack a core of Debian source packages until they build a minimal Debian system that you can chroot into and continue natively building the rest of it. But even if you manage to get that far you will continue to be plagued by cyclic build and runtime dependencies. So you start to hack source packages so that they drop some dependencies and you can break enough cycles to advance step by step. The Debian ports page lists 24 ports of Debian, so despite its unpleasant nature, porting it is something that is not done seldom. The process as laid out above has a number of drawbacks: If Debian would provide a set of core packages that are cross-compilable and which suffice for a minimal foreign build system, and if it would also have enough source packages that provide a reduced build dependency set so that all dependency cycles can be broken, building Debian for a yet unknown architecture could be mostly automated. The benefits would be: With three of this year's GSoC projects, this dream seems to come into reach. There is the "Multiarch Cross-Toolchains" project by Thibaut Girka and mentored by Hector Oron and Marcin Juszkiewicz. Cross-compiling toolchains need packages from the foreign architecture to be installed alongside the native libraries. Cross-compiler packages have been available through the emdebian repositories but always were more of a hack. With multiarch, it is now possible to install packages from multiple architectures at once, so that cross-compilation toolchains can be realized in a proper manner and therefor can also enter the main archives. Besides creating multiarch enabled toolchains, he will also be responsible for making them build on the Debian builld system as cross-architecture dependencies are not yet supported. There is also the "Bootstrappable Debian" project by Patrick "P. J." McDermott and mentored by Wookey and Jonathan Austin. He will make a small set of source packages multiarch cross-compilable (using cross-compilers provided by Thibaut Girka) and add a Build-Depends-StageN header to critical packages so that they can be built with reduced build dependencies for breaking dependency cycles. He will also patch tools as necessary to recognize the new control header. And then there is my project: "Port bootstrap build-ordering tool" (Application). It is mentored by Wookey and Pietro Abate. In contrast to the other two, my output will be more on the meta-level as I will not modify any actual Debian package or patch Debian tools with more functionality. Instead the goal of this project is threefold:
  1. find the minimal set of source packages that have to be cross compiled
  2. help the user to find packages that are good candidates for breaking build dependency cycles through added staged build dependencies or by making them cross-compilable
  3. develop a tool that takes the information about packages that can be cross compiled or have staged build dependencies to output an ordering with which packages must be built to go from nothing to a full archive
More on that project in my follow-up post.

25 May 2012

Johannes Schauer: setting up mailman, postfix, lighttpd

I was worried about having to learn hundreds of configuration options to properly set up mailman, postfix and lighttpd on Debian Squeeze. Turned out, that except for lighttpd it all works out of the box.
apt-get install postfix
When asked by debconf, I specified lists.mister-muffin.de as the fully qualified domain name.
apt-get install mailman
newlist mailman
The newlist command reminds me that I have to add its output to /etc/aliases. After doing so, I have to run:
newaliases
From now on, I can add any mailinglist by running newlist, editing /etc/aliases and running newaliases. Mailinglists can also be added through the mailman webinterface but one still has to put the according entries into /etc/aliases. Following is a working lighttpd configuration that works out of the box with the default settings of mailman on Debian squeeze. This was the only part that caused me some headaches.
server.modules += ("mod_alias", "mod_cgi", "mod_accesslog")
$HTTP["host"] == "lists.mister-muffin.de"   accesslog.filename =
    accesslog.filename = "/var/log/lighttpd/lists-access-log"
    alias.url += (
        "/cgi-bin/mailman/private/" => "/var/lib/mailman/archives/private/",
        "/cgi-bin/mailman/public/" => "/var/lib/mailman/archives/public/",
        "/pipermail/" => "/var/lib/mailman/archives/public/",
        "/cgi-bin/mailman/"=> "/var/lib/mailman/cgi-bin/",
        "/images/mailman/" => "/usr/share/images/mailman/",
    )
    cgi.assign = (
        "/admin" => "",
        "/admindb" => "",
        "/confirm" => "",
        "/create" => "",
        "/edithtml" => "",
        "/listinfo" => "",
        "/options" => "",
        "/private" => "",
        "/rmlist" => "",
        "/roster" => "",
        "/subscribe" => "")
 
server.document-root        = "/var/www"
server.errorlog             = "/var/log/lighttpd/error.log"
server.pid-file             = "/var/run/lighttpd.pid"
server.username             = "www-data"
server.groupname            = "www-data"
index-file.names            = ( "index.html" )
server.dir-listing          = "disable"
include_shell "/usr/share/lighttpd/create-mime.assign.pl"
As a bonus, I wanted to import my existing email exchange with my GSoC mentors into the mailinglist. First I was planning on manually sending the email messages to the list, but a much easier option is to just import them in mbox format. To extract all email messages, I first wrote the following python snippet:
import mailbox, itertools
box = mailbox.mbox('~/out')
for message in itertools.chain(mailbox.mbox('~/sent'), mailbox.Maildir('~/Mail/Web/', factory=None)):
if (("wookey" in message.get('to', "").lower()
or "wookey" in message.get('cc', "").lower()
or "wookey" in message.get('from', "").lower()
or "abate" in message.get('to', "").lower()
or "abate" in message.get('cc', "").lower()
or "abate" in message.get('from', "").lower())
and not message['subject'][0] == '['
and not message['subject'] == "multistrap"):
box.add(message)
box.close()
It iterates through messages in my mbox and maildir mailboxes, filters them for emails by wookey or pietro, strips away some messages I found to not be relevant and then saves the filtered result into the mbox mailbox ~/out. It is important to specify factory=None for the Maildir parser, because it otherwise defaults to rfc822.Message instead of MaildirMessage. Also do not forget to call box.close(). I initially forgot to do so and ended up with missing messages in ~/out. I then copy the archive in its place:
scp out lists.mister-muffin.de:/var/lib/mailman/archives/private/debian-bootstrap.mbox/debian-bootstrap.mbox
Another thing that initially caused me trouble, was that the mbox didnt have the correct permissions due to the scp. Fixing them:
chown -R list:www-data /var/lib/mailman/archives/private/
chmod 664 /var/lib/mailman/archives/private/debian-bootstrap.mbox/debian-bootstrap.mbox
And update the mailman archive like this:
sudo -u list /usr/lib/mailman/bin/arch debian-bootstrap /var/lib/mailman/archives/private/debian-bootstrap.mbox/debian-bootstrap.mbox
Initially I was running the above command as root which screws up permissions as well.

21 May 2012

Johannes Schauer: sisyphus wins ICRA 2012 VMAC

Sisyphus is a piece of software that I wrote as a member of a team from Jacobs University led by Prof. Dr. Andreas N chter. It managed to place our team first in this year's IEEE ICRA 2012 Virtual Manufacturing Automation Competition in all three rounds. The goal was, to stack a given set of boxes of different length, height and width on a pallet in a way that achieved optimal volume utilization, center of mass and interlock of the boxes. Besides the cartesian placement of a box on the pallet, the only other degree of freedom was a 90 rotation of the box around a vertical axis. Since the arrangement of boxes into a three dimensional container is NP hard (three dimensional orthogonal knapsack), I decided for a heuristic for an approximate solution. The premise is, that there are many boxes of equal height which was the case in the test cases that were available from the 2011 VMAC. Given this premise, my heuristic was, to arrange the boxes into layers of equal height and then stack these layers on top of each other. A set of boxes that would be left over or too little from the start to form its own full layer, would then be stacked on the top of the completed layers. There is a video of how this looked like. My code is now online on github and it even documented for everybody who is not me (or a potential future me of course). This blog post is about the "interesting" parts of sisyphus. You can read about the overall workings of it in the project's README. Python dict to XML and XML to Python dict The evaluation program for the challenge is reading XML files and the pallet size and the list of articles with their sizes are also given in XML format. So I had to have a way to easily read article information from XML and to easily dump my data into XML format. Luckily, all the XML involved was not making use of XML attributes at all, so the only children a node had, where other nodes. Thus, the whole XML file could be represented as an XML dictionary with keys being tagnames and the values being other dictionaries or lists or strings or integers. The code doing that uses xml.etree.ElementTree and turns out to be very simple:
from xml.etree import ElementTree

def xmltodict(element):
def xmltodict_handler(parent_element):
result = dict()
for element in parent_element:
if len(element):
obj = xmltodict_handler(element)
else:
obj = element.text
if result.get(element.tag):
if hasattr(result[element.tag], "append"):
result[element.tag].append(obj)
else:
result[element.tag] = [result[element.tag], obj]
else:
result[element.tag] = obj
return result
return element.tag: xmltodict_handler(element)

def dicttoxml(element):
def dicttoxml_handler(result, key, value):
if isinstance(value, list):
for e in value:
dicttoxml_handler(result, key, e)
elif isinstance(value, basestring):
elem = ElementTree.Element(key)
elem.text = value
result.append(elem)
elif isinstance(value, int) or isinstance(value, float):
elem = ElementTree.Element(key)
elem.text = str(value)
result.append(elem)
elif value is None:
result.append(ElementTree.Element(key))
else:
res = ElementTree.Element(key)
for k, v in value.items():
dicttoxml_handler(res, k, v)
result.append(res)
result = ElementTree.Element(element.keys()[0])
for key, value in element[element.keys()[0]].items():
dicttoxml_handler(result, key, value)
return result

def xmlfiletodict(filename):
return xmltodict(ElementTree.parse(filename).getroot())

def dicttoxmlfile(element, filename):
ElementTree.ElementTree(dicttoxml(element)).write(filename)

def xmlstringtodict(xmlstring):
return xmltodict(ElementTree.fromstring(xmlstring))

def dicttoxmlstring(element):
return ElementTree.tostring(dicttoxml(element))
Lets try this out:
>>> from util import xmlstringtodict, dicttoxmlstring
>>> xmlstring = "<foo><bar>foobar</bar><baz><a>1</a><a>2</a></baz></foo>"
>>> xmldict = xmlstringtodict(xmlstring)
>>> print xmldict
 'foo':  'baz':  'a': ['1', '2'] , 'bar': 'foobar' 
>>> dicttoxmlstring(xmldict)
'<foo><baz><a>1</a><a>2</a></baz><bar>foobar</bar></foo>'
The dict container doesnt preserve order, but as XML doesnt require that, this is also not an issue. Arranging items in layers When it was decided, that I wanted to take the layered approach, it boiled down the 3D knapsack problem to a 2D knapsack problem. The problem statement now was: how to best fit small rectangles into a big rectangle? I decided for a simple and fast approach as it is explained in Jake Gordon's blog article. There is a demo of his code and should the site vanish from the net, the code is on github. This solution seemed to generate results that were "good enough" while simple to implement and fast to execute. If you look very hard, you can still see some design similarities between my arrange_spread.py and his packer.js code. Jake Gordon got his idea from Jim Scott who wrote an article of arranging randomly sized lightmaps into a bigger texture. There is also an ActiveState Code recipe from 2005 which looks very similar to the code by Jake Gordon. The posts of Jake Gordon and Jim Scott explain the solution well, so that I dont have to repeat it. Should the above resources go offline, I made backups of them here and here. There is also a backup of the ActiveState piece here. Spreading items out The algorithm above would cram all rectangles into a top-left position. As a result, there would mostly be space at the bottom and left edge of the available pallet surface. This is bad for two reasons:
  1. the mass is distributed unequally
  2. articles on the layer above at the bottom or left edge, are prone to overhang too much so that they tumble down
Instead all articles should be spread over the available pallet area, creating small gaps between them instead big spaces at the pallet borders. Since articles were of different size, it was not clear to me from the start what "equal distribution" would even mean because it was obvious that it was not as simple as making the space between all rectangles equal. The spacing had to be different between them to accommodate for differently sized boxes. The solution I came up with, made use of the tree structure, that was built by the algorithm that arranged the rectangles in the first place. The idea is, to spread articles vertically first, recursively starting with the deepest nodes and spreading them out in their parent rectangle. And then spreading them horizontally, spreading the upper nodes first, recursively resizing and spreading child nodes. The whole recursion idea created problems of its own. One of the nicest recursive beauty is the following function:
def get_max_horiz_nodes(node):
if node is None or not node['article']:
return [], []
elif node['down'] and node['down']['article']:
rightbranch, sr = get_max_horiz_nodes(node['right'])
rightbranch = [node] + rightbranch
downbranch, sd = get_max_horiz_nodes(node['down'])
ar = rightbranch[len(rightbranch)-1]['article']
ad = downbranch[len(downbranch)-1]['article']
if ar['x']+ar['width'] > ad['x']+ad['width']:
return rightbranch, sr+[downbranch[0]]
else:
return downbranch, sd+[rightbranch[0]]
else:
rightbranch, short = get_max_horiz_nodes(node['right'])
return [node] + rightbranch, short
get_max_horiz_nodes() traverses all branches of the tree that node has below itself and returns a tuple containing the list of nodes that form the branch that stretches out widest plus the list of nodes that are in the other branches (which are shorter than the widest). Another interesting problem was, how to decide on the gap between articles. This was interesting because the number resulting of the subtraction of the available length (or width) and the sum of the articles lengths (or widths), was mostly not divisible by the amount of gaps without leaving a rest. So there had to be an algorithm that gives me a list of integers, neither of them differing by more than one to any other, that when summed up, would give me the total amount of empty space. Or in other words: how to divide a number m into n integer pieces such that each of those integers doesnt differ more than 1 from any other. Surprisingly, generating this list doesnt contain any complex loop constructs:
>>> m = 108 # total amount
>>> n = 7 # number of pieces
>>> d,e = divmod(m, n)
>>> pieces = (e)*[(d+1)]+(n-e)*[d]
>>> print pieces
[16, 16, 16, 15, 15, 15, 15]
>>> sum(pieces) == m
True
>>> len(pieces) == n
True
You can test out the algorithms that arrange rectangles and spread them out by cloning the git and then running:
PYTHONPATH=. python legacy/arrange_spread.py
The results will be svg files test1.svg and test2.svg, the latter showing the spread-out result. Here is an example how the output looks like (without the red border which is drawn to mark the pallet border): arrange_spread2.py contains an adaption of arrange_spread.py for the actual problem. Permutations without known length When creating a layer out of articles of same height, then there are four strategies that I can choose from. It is four because there are two methods that I can either use or not. I can rotate the article by 90 per default or not and I can rotate the pallet or not. So every time that I build a new layer, there are those four options. Depending on which strategy I choose, there is a different amount of possible leftover articles that did not fit into any layer. The amount is different because each strategy is more or less efficient. To try out all combinations of possible layer arrangements, I have to walk through a tree where at each node I branch four times for each individual strategy. Individual neighboring nodes might be the same but this outcome is unlikely due to the path leading to those neighboring nodes being different. To simplify, lets name the four possible strategies for each layers 0, 1, 2 and 3. I now want an algorithm that enumerates through all possible permutations of those four numbers for "some" length. This is similar to counting. And the itertools module comes with the product() method that nearly does what I want. For example, should I know that my tree does not become deeper than 8 (read: no more than eight layers will be built), then I can just run:
>>> for i in itertools.product([0,1,2,3], repeat=8):
...     print i
...
(0,0,0,0,0,0,0,0)
(0,0,0,0,0,0,0,1)
(0,0,0,0,0,0,0,2)
(0,0,0,0,0,0,0,3)
(0,0,0,0,0,0,1,0)
(0,0,0,0,0,0,1,1)
(0,0,0,0,0,0,1,2)
(0,0,0,0,0,0,1,3)
This would work if the number of layers created with each strategy was the same. But as each strategy behaves differently depending on the input, it cannot be known before actually trying out a sequence of strategies, how many layers it will yield. The strategy (0,0,0,0,0,0,0,0) might create 7 layers, resulting in (0,0,0,0,0,0,0,1), (0,0,0,0,0,0,0,2) and (0,0,0,0,0,0,0,3) yielding the same output as only the first 7 strategies count. This would create duplicates which I should not waste cpu cycles on later. It might also be that (0,0,0,0,0,0,1,0) turns out to be a combination of strategies that creates more than 8 layers in which case the whole thing fails. So what I need is a generator, that gives me a new strategy depending on how often it is asked for one. It should dynamically extend the tree of possible permutations to accommodate for any size. Since the tree will become humongous (4^11 = 4194304), already traversed nodes should automatically be cleaned so that only the nodes that make the current list of strategies stays in memory at any point in time. This sounded all complicated which made me even more surprised by how simple the solution was in the end. Here a version of the algorithm that could easily be ported to C:
class Tree:
def __init__(self, branch_factor):
self.branch_factor = branch_factor
self.root = "value": None, "parent": None, "children": []
self.current = self.root

def next(self):
if not self.current["children"]:
self.current["children"] = [ "value":val, "parent":self.current, "children":[] for val in range(self.branch_factor)]
self.current = self.current["children"][0]
return self.current["value"]

def reset(self):
if self.current["parent"]:
self.current["parent"]["children"].pop(0)
else:
return False
if self.current["parent"]["children"]:
self.current = self.root
return True
else:
self.current = self.current["parent"]
return self.reset()

def __str__(self):
return str(self.root)
It would be used like this:
>>> tree = Tree(4)
>>> print tree.next(), tree.next(), tree.next()
>>> while tree.reset():
...     print tree.next(), tree.next(), tree.next()
Which would be equivalent to calling itertools.product([1,2,3,4], 3). The special part is, that in each iteration of the loop I can call tree.next() an arbitrary amount of times, just how much it is needed. Whenever I cannot generate an additional layer anymore, I can call tree.reset() to start a new permutation. For my code I used a python specific version which is a generator:
def product_varlength(branch_factor):
root = "value": None, "parent": None, "children": []
current = root
while True:
if not current["children"]:
current["children"] = [ "value":val, "parent":current, "children":[] for val in range(branch_factor)]
current = current["children"][0]
if (yield current["value"]):
while True:
if current["parent"]:
current["parent"]["children"].pop(0)
else:
return
if current["parent"]["children"]:
current = root
break
else:
current = current["parent"]
It is used like this:
it = product_varlength(4)
print it.next(), it.send(False), it.send(False)
while True:
    print it.send(True), it.send(False), it.send(False)
Again, the expression in the loop can have any number of it.send(False). The first it.send(True) tells the generator to do a reset.

Next.

Previous.